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Introduction 


@ It is paradoxical that after the pioneering work by von Neumann, 
Eckert, and others which dealt with how to make reliable systems 
from unreliable parts the result today is just the reverse. 


e@ The purpose of this talk is to explain how this state was reached 
and what can be done to achieve reliability. 


@ The key concept that was missing in the early work was the need 
to make all faults corruption-free, and to make failing component 
replacement a simple process. 
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The Problem 
e 


Background 





Background 
@ Digital systems as presently implemented are nearly all unsafe. 
@ “Silent” errors occur throughout the system. 


@ These may be bit flips in data, in addresses, in read/write 
commands, or elsewhere. 


@ Error detection is often lacking, is ignored, or leads to system 
malfunction. “Fault-tolerant” systems attempt to mitigate the 
effects of detected errors, but they fail to yield correct results. 


e@ An estimate of the lower bound on one silent error rate is about 
0.64 checksum mismatches per 100 drives per month[3][2]. 


@ The result is that stored data are silently and irreversibly lost, and 
incorrect computations are treated as correct. 
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This problem has been a part of digital systems since their 
beginnings. 


Von Neumann discussed the theory in his 1952 Caltech lecture: 
“Probabilistic Logics and the Synthesis of Reliable Organisms 
from Unreliable Components.’[10] 


J. Presper Eckert recognized the practical problem and 
implemented (mainly duplicate comparison) checking in EDVAC 
and subsequent UNIVAC systems. See: Section 8 in[4] and [11]. 


Seymour Cray expressed an opposing view: “parity is for farmers.” 
However, he also subjectively originated the formal definition of 
digital logic in RTL form, first defined by Babbage[1]. 


The Problem 





Why are Digital Systems Unique? 


Why are Digital Systems Unique? 


@ The defining property of digital systems is exact bit reproducibility. 
This is achieved by means of restoring logic, as von Neumann 
explained. Reliability of digital systems is not relative. 


@ All other human artifacts, such as transportation systems, 
telephones, or buildings, are deemed to be “reliable” if they 
function as intended within appropriate tolerances. Changes in 
the state of such systems have localized and locally connected 
effects. 


@ There is no measure of “locality” or “significance” in digital 
systems. All bits are “the same” to the system. 


@ Current error checking does not usually prevent corruption. 
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The Problem 





Examples of Error Checking 


Examples of Error Checking 
IBM disk drives circa 1968. (Address parity check.) 
Magnetic tape transfers: IBM and GE. (Incomplete, off by one.) 
FTP transfers. (Incomplete, index errors.) 
Memory parity checking. (Ignored.) 
ECC checking. (Unmanaged.) 
The Unix/Linux “anything goes” principle. 


RAID systems: (1)Backup failure, (2)device ID error. 


In code that has been examined, the error density is by far the 
highest in the error handling code. 
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@ All errors must be detected and contained within a defined scope 
and before data propagation which could cause corruption.[9] 


e All data structures must be continuously in a consistent state. 
(Copy write and toggle.) 


e The currently pervasive filesystem designs, based on the Unix 
“unidentified record” approach and inconsistency are inherently 
unsafe (as fsck consistently proves). 


© 


Filesystems based on continuous consistency and self-identifying 
records have been implemented as described by Morris[6]. The 
conversion to such a filesystem will be necessary at some time, as 
will some form of monotonic database. (See lliffe[5] for a more 
general discussion of System principles.) 


A Solution 
e 





Attributes of a Solution 


Attributes of a Solution 
@ Data through-checking: no unchecked transfers. No “single valid 
copy.” 
@ All staticized data covered by error correction codes. 
@ Inhibit propagation of any error by check before transfer. 
@ Precise error reporting to define and isolate failing components. 


@ System controller manages error reports and component 
replacements. 


e@ Demonstration of correctness by full fault injection. 


@ Improved control software that matches fully checked hardware 
(VM-based). 
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@ An INCITS Standard from 2003 (T10 Data Integrity Field) provides 
an industry standard for dealing with the SCSI part of this 
problem. The 2003 DIF standard is in the Linux kernel since 3.4, 
provided by Martin Petersen[7][8]. Disk manufacturers are 
providing the space for CRC in sectors (512B+8 or 4KB+8). 


e IBM has patents dating from 2011 which are based on the T10 
specification but claim additional capabilities. Fusion-lo has about 
160 patents and Intel about 14, one of which seems to just claim 
the T10 specification. The total number of patents referencing T10 
is something like 600. No sign of Google. 


@ Much more must to be defined, made standard, and implemented 
to yield safe systems. | am not aware of any such work which is 
openly available. 


Summary 
e 





Summary 


@ Current digital systems are unsafe. 


@ Data are being silently corrupted at a high rate (> 0.64 checksum 
mismatches/100 drives/month). [3],[2]. 


@ No viable mechanisms are in place to detect or recover from the 
corruption. 


@ However, it is not sufficient to claim reliability or correctness based 
on design, implementation, and test: correct operation in the 
presence of errors must be demonstrable at any time during 
normal system operation. 
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